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Abstract 

In this article, we develop an algorithm for probabilistic and constrained projection pursuit. 
Our algorithm called ADIS (automated decomposition into sources) accepts arbitrary non-linear 
contrast functions and constraints from the user and performs non-square blind source sepa- 
ration (BSS). In the first stage, we estimate the latent dimensionality using a combination of 
bootstrap and cross validation techniques. In the second stage, we apply our state-of-the-art 
optimization algorithm to perform BSS. We validate the latent dimensionality estimation pro- 
cedure via simulations on sources with different kurtosis excess properties. Our optimization 
algorithm is benchmarked via standard benchmarks from GAMS performance library. We de- 
velop two different algorithmic frameworks for improving the quality of local solution for BSS. 
Our algorithm also outputs extensive convergence diagnostics that validate the convergence to 
an optimal solution for each extracted component. The quality of extracted sources from ADIS 
is compared to other well known algorithms such as Fixed Point ICA (FPICA), efficient Fast 
ICA (EFICA), Joint Approximate Diagonalization (JADE) and others using the ICALAB tool- 
box for algorithm comparison. In several cases, ADIS outperforms these algorithms. Finally we 
apply our algorithm to a standard functional MRI data-set as a case study. 

1 Introduction 

The Generalized Linear Model (GLM) is a popular tool for analyzing functional MRI (fMRI) 
data. GLM analysis proceeds on a voxel by voxel basis using the same design matrix. One of 
the difficulties associated with GLM analysis is the construction of an appropriate design matrix. 
Unmodeled regressors that modulate the fMRI signal in addition to the EVs but are not a part of the 
design matrix will invalidate the analysis and the associated inferences. Further, these unmodeled 
regressors might be different in different brain regions and a voxel by voxel analysis with a common 
design matrix may not be appropriate. These considerations imply that model based analyses make 
very strong assumptions which are very likely violated in a real fMRI dataset. 

Model free analysis on the other hand does not need any postulations as to the shape of the 
expected response. One such technique that has become popular in recent years, particularly for 
application to fMRI is Independent Component Analysis (ICA) |9]. See [25] for a survey on ICA. 
Popular software packages such as FSL ([35j) come with ICA software for doing non-square ICA via 
automatic latent dimensionality estimation also known as Probabilistic Independent Component 
Analysis (PICA) |2]. Essentially these techniques consists of a data reduction step using PGA 
followed by the application of standard ICA algorithms. In the signal processing community, many 
well established algorithms exist for doing ICA, the most popular ones being efficient FastICA [27J, 
Fixed Point ICA (FPICA) [26], Joint Approximate Diagonalization of Cumulant Matrices (JADE) 
[6], Extended Robust ICA (ERICA) [l3] and unbiased ICA (UNICA) [11]. In this article, we will 
question the validity of solution produced by current ICA codes. Are these solutions really optimal? 
Can we afford to pay the price for using non-optimal solutions? 

One of the challenges involved in applying ICA to real data is the confidence in the quality of 
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estimated solution. This problem has been recognized before. For example the software package 
ICASSO [21j uses bootstrapping simulations to run an ICA algorithm multiple times and then 
clusters the estimated sources to assess reliability. ICASSO takes into account the "algorithmic" 
variability and the variability in the original data induced due to sampling, but since it assumes 
square, noise free mixing it ignores the estimation errors induced due to noisy source mixing. The 
presence of non-square mixing in real data also introduces additional variability due to the unknown 
latent dimensionality of the sources. 

While a bootstrapping strategy can always be used to test the "sensitivity" of estimated BBS 
solution from any algorithm, it is critical to have a reliable and verifiable optimization solver to 
solve the non-convex BSS problem in the first place. 

Currently existing ICA codes perform optimization using techniques that have formulas for updat- 
ing the unknown variables using a Newton step, gradient descent, natural gradient |3], [1] [18] or 
similar strategies. In addition they also potentially have a number of heuristic rules for updating 
the various "learning" parameters in these algorithms. No convergence diagnostics are used to 
check the optimality of estimated solution. To the best of our knowledge, such optimization codes 
are not benchmarked using standard optimization test cases. Such ad-hoc strategies along with 
non- verified optimality may severely affect the quality of solution produced by these algorithms and 
potentially impact the practical conclusions drawn from incorrect results. The popular FastICA 
algorithm [2^ uses an approximate Newton iteration where the approximate Hessian simplification 
reduces it to a gradient descent algorithm with a fixed step size. However there is no reason to use 
these approximations. One can use state of the art optimization software to compute near exact 
step sizes with locally varying Hessian approximations. In fact, it has been shown [39] that these 
exact step size search significantly increases the estimation efficiency and robustness to initializa- 
tion in comparison to the fixed update rules of FastICA. The optimization core in our algorithm 
ADIS efficiently handles local non-convexity as well as allows for infeasible steps, i.e., steps that 
violate the constraints leading to a more fuller exploration of parameter space and increasing the 
likelihood of converging to a global optimum. 

Another issue is the modification of the default contrast function in ICA (e.g. Negentropy) to 
do other types of source extraction. It is also desirable to be able to add additional equality 
and/or inequality constraints to BSS estimation depending on the application at hand. These 
issues cannot be addressed using current ICA software. In this paper, we develop our algorithm 
ADIS and validate its various components. Finally, we compare ADIS to currently existing ICA 
codes on many standard benchmark datasets from ICALAB. 

Our algorithm ADIS (section [2| |3]) : 

1. Uses a state-of-the-art optimization algorithm at its core (inspired by LANCELOT software 
[H]) (section |4) [14]) 



2. Uses a bootstrap simulation/cross- validation based approach for latent dimensionality esti- 
mation (in case of non-square BSS) (section [6]) 
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3. Enables the user to use arbitrary contrast functions and nonlinear constraints for BSS 

4. Produces "good quality" local solutions using a special multistage framework for BSS (section 

5. Produces extensive convergence diagnostics for each extracted component to validate the 
"optimality" of the extracted source (section [4| 

We perform validation of each component of ADIS as follows: 

1. Validation of the latent dimensionality estimation procedure using simulations on sources 
with different statistical properties (section [6| 

2. Validation of our optimization core using standard benchmarks from the GAMS performance 
library ( jhttp : //www . gamsworld . org/perf ormaiice[ |15] ) (section [Tsj) 

3. We then use the "Negentropy" contrast function as a special case and compare the results 
of ADIS in terms of separation quality and robustness to other well known algorithms such 
as efficient FastICA, FPICA, JADE and others using the ICALAB toolbox |8j , ^Tj for BSS 
algorithm comparison, (section [?]) 

4. Finally we apply ADIS to real fMRI data as a case study (section [8| 



2 Probabilistic Projection Pursuit 

Projection Pursuit is a standard statistical technique for data analysis [IZ],[l6], [22]. In this ar- 
ticle we generalize projection pursuit in a probabilistic framework similar to the one proposed by 
Beckmann et. al. {[2]). We consider data generation at n points via a noisy mixing process as 
follows: 



x = fj, + As + ri, i = l,2,...,n (1) 

where x £ R^, s £ R"? and A £ R-^^^, r/ ~ N(0, a'^V). We assume that p > q to achieve a compact 
representation of the observed data. 

The problem is to estimate automatically q, A, s and fx given observations Xi,i = 1 . . .n. This 
problem is also called a blind source separation (BSS) problem since q, A and s are all unknown. 
The inclusion of noise term r] makes the problem into a probabilistic one. 



First consider the case when Vi = Ip for all i. Section 5.1 shows how to handle the case Vi ^ 
Ip,\/i. 

If Ip = [1,1,..., 1]"^ G R^ be a vector of ones. Then 
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ljx = lj/i + lj^s + ljry,i = l,2,...,n (2) 

Let 

D = D- IplpD, where D = x,A,fi,r) (3) 

Then we can write 

X = fl + As + fj (4) 

where ~ N ^0, cr'^{Ip — IplJ/p)^ Since the scahng of A and s is arbitrary we assume without loss 
of generahty that 

E(s) = and Cov(s) = I, (5) 

No other assumptions are made about the joint source density p{s) other than the ones injsj From 
equations |4] and [5} 

A = E{x) (6) 

E[{x -Ji){x- Uf] = AA^ + a\lp - Ipll/p) (7) 
Using the law of large numbers (LLN) we can approximate the covariance matrix as: 

1 

E[{x -i2){x- flf] « - J2{xi - j2){xi - flf = U^U^ (8) 

1=1 

In [s] S = diag{\k) with \k-,k = the singular values and U a matrix containing the 

corresponding singular vectors of the covariance matrix. The estimate of /x is easily obtained from 

El 

1 " 

fL=-^Xi (9) 

1=1 

Without knowing the true source densities p{s) it is not possible to estimate the maximum likelihood 
estimates of A and o"^. However one can find estimates that try to satisfy the second order condition 
[7] as closely as possible by minimizing the Frobenius norm: 



i, a2 = argmin II - V(Si - fi){xi - fif - AA^ - a'^{Ip - lpl^/p)\\l 
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It is easily shown that if Uq is a p x q submatrix of U containing the first q singular vectors 
corresponding to the q largest singular values and T,q is the q x q submatrix of S then the solution 



to 10 is given by: 



and 

p-i 



a2 



^ 1 E (11) 



P-1 - i=g+l 



where Q is an arbitrary q x q orthogonal matrix {Q^Q = Iq). Given A and cr^, the least squares 
estimate of Si G R'^ are given by: 



= {A^A)-^A^{xi - A) = - a^Iq)-^'^U^{xi -fi) = Qii (12) 

where 

Xi = {J:q-a^Iq)-'/^U^{x,-fi) (13) 

3 Problems solved in Projection Pursuit 

In Projection Pursuit (PP), we parameterize the orthogonal matrix Q as 

Q = [WI,W2 ■ . . ,Wq]'^ (14) 

where Wi G R''. The kth PP projection is defined as for each point i: 

Ski = w'^Xi, i = 1 . . .n and k = 1 . . .q (15) 

In vector form we can write the kth projection as: 

= w^x (16) 

where s'' = [ski, • • • , is a 1 x n vector and x = [xi, X2, ■ ■ ■ , Xn] is a q x n matrix. We define 
an objective function f{s^, s^, . . . , s'^) and optimize for the projection vectors. Mathematically we 
solve the optimization problem: 

[wl , . . . , w*] = argmax f{w^x, . . . , vo^x) + b{w^x, . . . , w^x; Q) 

W-i,...,Wq 



3. 1 Separability 
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Here b is function accounting for supplementary information that we want to include in the op- 
timization problem. Q is all supplementary information of interest to the optimization problem. 
For example, in spatial problems Q could be a spatial location of points and g could be a function 
accounting for spatial smoothness (such as a Markov random field). Many such problem specific 
functions can be proposed based on user objectives. There could also be additional user defined 
equality and inequality constraints, for example: 

Ci{w^x) = 0, i = 1 . . .m, ,k = 1 . . . q (17) 



Qiiw^x) >0, i = 1 . . . L, ,fc = l 



(18) 



Thus in general, we get a constrained projection pursuit problem. Constraints 17 and 18 can be 
arbitrary non-linear constraints not necessarily parameterized by wTx. 



3.1 Separability 



The optimization problem in 17 must involve a joint optimization of the vectors wi^ . . . ,Wq in 
general. In many important practical cases (such as for example when / is the joint negentropy 
index) the objective function / has a separable structure such that 
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f{wJx,W2X,...,WgX) = ^h{wlx) (19) 

fc=i 

for some function h, then the optimization can proceed sequentially where at each stage we 
solve 

wl = argmin ^^h{wlx) (20) 
At each stage is a unit vector orthogonal to the previously calculated vectors i.e 

*T * jo when I < k , . 

^k^l=\ ^ if l = k (^^^ 



3.2 Multistage Optimization stragegy 

In this section we describe a 2 or 3 stage strategy which we have found via experience to converge to 
good "local" solutions. It is well known that if an optimization problem that has multiple optima 
then the "local" solution found by an algorithm is strongly dependent on how the algorithm is 
initialized. For applications such as BSS, even though an algorithm finds a "local" solution as 
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indicated by convergence diagnostics, it may not be a "global" solution. In order to increase our 
chances of finding a solution that is global, we propose a random sampling strategy for initialilzation 
of an optimization algorithm first. 

3.2.1 Stage 0: Search for good seed points 

For concreteness, suppose tt;*, • • • , are the optimal solutions found previously and suppose 
we are tying to find w^. Let W he a q x {q — k + 1) matrix that is the orthogonal complement of 
[wl,W2, • • • , as determined by say Gram-Schmidt orthogonalization. Then 

W^wt = 0, l = l,2,...,k-l (22) 

1. Generate Ug vectors in R'' whose elements are drawn from a uniform distribution on (—1,1). 

2. Standardize each vector to have unit norm to get the set of vectors Z = [zi, Z2, ■ ■ ■ , Zn^] where 
each Zi G R'^. 

For each Zi we define seed points as 

= Wzi (23) 

It is easy to see that ufwi = 0,/ = l,2,...,A; — 1 and ufui = 1. Thus Ui satisfies the constraints in 
We then compute the objective function h{w'^x) at each of these points Ui,i = 1,2, ... ,ns and 
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choose R points ti,. . . ,tji that give the highest objective function values as candidate seed points 
for the next step. 

3.2.2 Stage 1: Computing local optimum at R best points from Stage 1 
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In this stage, we compute the local solutions w^}- , w'^'^ , . . . , w^^ of the optimization problem 
starting from ti, . . . ,tji and choose the best local solution that has the highest function value. 

wl. = argmax hk{w*j^),i = 1,2, . . . , R (24) 

ADIS uses Ug = 1000 and R = 2 as the defaults. 

3.2.3 Stage 2: Joint optimization with initialization from Stage 2 

After Stage 1 has been applied from k = l,...,q we have an estimate of the solution vectors 



vj^jk = 1,2, . . . ,q. In this stage we solve the joint optimization problem 19 subject to the single 
joint constraint: 



9 q 



Y^Y.(^fujj-5ij)' = (25) 

1=1 j=i 
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where 6ij = 1 ii i = j and otherwise. We initiahze the algorithm with the solution wl.,k = 1, . . . ,q 
from Stage 1. The algorithm converges to a local joint solution only in a few iterations. 

4 Optimization Algorithm 

For flexible and powerful BSS algorithms, a primary requirement is a fast and robust optimization 
algorithm that can handle non-linear user defined constraints as well as handle non-convexity in 
the objective function or the constraints. Furthermore, extensive convergence diagnostics should 
be a standard output of the optimization process to ensure convergence to a local solution. The 
optimization core in ADIS (coded in MATLAB, www.mathworks.com) uses a modified augmented 
lagrangian algorithm (inspired by the implementation in LANCELOT package |T0], [12]) to solve 
equality constrained problems generated in constrained non-convex BSS problems. Inequality con- 
straints are handled by first transforming them to equality constraints via slack variables and 
solving the resulting bound constrained optimization problem. Some features of interest are as 
follows: 

1. A Trust region based approach [30] is used to generate search directions at each step (for 
both equality constrained and inequality constrained problems). 

2. For equality constraints only, the subproblems above are solved using a conjugate gradient 
approach (Newton-CG -Steihaug) [36] that is fast and accurate even for large problems and 
can handle both positive definite and indefinite hessian approximations. If both equality 
and inequality constraints are present then we solve the trust region problem with a non- 
linear gradient projection technique [S] followed by subspace optimization using Newton- 
CG-Steihaug. Our algorithm allows for infeasible iterates i.e., those that do not satisfy the 
problem constraints during optimization. This allows for a fuller exploration of parameter 
space and increases the likelihood of converging to a global optimum. 

3. A symmetric rank 1 (SRI) quasi-Newton approximation to the hessian [Tl] is used which is 
known to generate good hessian approximations for both convex and non-convex problems. As 
suggested in [33 we do the update also on the rejected steps to gather curvature information 
about the function. We provide options for BFGS |4j especially for convex problems and an 
option for preconditioning the CG iterations. We also implement limited memory variants of 
SRI and BFGS for large problems. 

4. Our algorithm accepts vectorized constraints so that multiple constraints can be programmed 
simultaneously. Only gradient information is required. Hessian information is optional but 
not required. Optionally, it is easy to interface our code with INTLAB software package [33] 
for automatic differentiation in which case the user only codes the function and constraints 
and the gradients/hessians are generated automatically. 



4.1 Convergence Diagnostics 
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We tested the performance of our algorithm using standard optimization benchmarks from the 
GAMS performance benchmark problems (http://www.gainsworld.org/perforiiiaiice [E]). The 



appendix shows some sample benchmarks as well as provides more technical details of the algo- 
rithm. 



4.1 Convergence Diagnostics 

It is critical to verify that the optimization algorithm has found a local solution by checking con- 
vergence diagnostics. These diagnostics help us determine if the Karush-Kuhn- Tucker (KKT) [33] 
necessary conditions for optimality have been satisfied or not. Profile plots are plots of a diagnostic 
measure versus iteration number. We propose the following checks for all BSS algorithms: 

1. Convergence to a point satisfying necessary conditions for optimality can be accessed by 
looking at profile plots for 

• Objective function 

• Optimality error (such as "gradient of the lagrangian" for equality constraints or "KKT 
optimality checks" for general constraints) 

• Feasibility error (checking constraint satisfaction) 

• Lagrange multipliers 

• Other parameters in the algorithm (such as a barrier or penalty parameter) 

The user should at the very least check these diagnostic plots to make sure convergence 
is attained. If possible the second order sufficient conditions for optimality should also be 
verified at the solution point using Hessian information. 

2. The algorithm used for optimization should flag an error and stop running in case con- 
vergence is not attained at any intermediate stage. This prevents the user from getting access 
to incorrect results. 

These convergence diagnostics are a standard feature of ADIS. Any solution returned by ADIS is 
guaranteed to be optimal. 



5 Statistics on estimated sources 



In this section we develop equations that enable us to apply ADIS to a real data-set and make 
inferences from extracted sources. 



5.1 Correcting for Autocorrelation 
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Once Si are estimated we can compute their variance using the GLM estimate 



(26) 



where af is the estimated variance at point i 




{xi- ji- Asi)'^{xi - II- Asi) 



(27) 



p-q 



We can create maps of contrasts of interest using the above equations. We also estimate the relative 
variance (RV) contribution at point i using the component k as follows: 



Inspection of these voxelwise variance explained maps is very useful in searching through the esti- 
mated sources for application based relevance. 

5.1 Correcting for Autocorrelation 

Extending to the case of autocorrelated noise is straightforward. When Vi ^ I then we proceed in 
an iterative fashion as follows: 

1. Set Vi = I and estimate A, s and compute the pointwise residual 



2. Compute autocorrelation in fj and prewhiten the data Xi using the estimated correlation 
matrix to get x^*". Run the PP algorithm on x^^ until x^^ does not change from one iteration 
to the next in an average sense. Various prewhitening schemes such as AR(p) models or non- 
parametric approaches can be used. ADIS uses the non-parametric approach proposed in 
|38| to do prewhitening. By default ADIS will not do iterative prewhitening unless explicitly 
specified by the user. 



If j4 = [oi, 02, . . • , a<j] then 




(28) 



fi — Xi /X ASi 



(29) 
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6 Latent dimensionality estimation 

Sophisticated bayesian strategies exist for estimating the latent dimensionality in the case of Gaus- 
sian sources [29]. In this section we develop a latent dimensionality estimation procedure that 
works very well both with Gaussian and non-Gaussian sources using a bootstrap/cross- validation 
procedure. 

The latent dimensionality q is estimated in two steps. We first estimate a lower bound on the 



latent dimensionality (6.1) followed by a cross validation analysis to refine the lower bound (6.2). 



In subsection |6.3| we validate this approach via extensive numerical simulations. 

6.1 Stage 1 - Estimating the lower bound 

Let 



and suppose Ai > A2 > • • • Ap are the p eigenvalues of XX'^ /n. 



X = [Xl - fl, X2 - fJ., ■ ■ . , Xn - l4 (30) 

p are 

1. Randomly permute each column of the p x n matrix X to get the matrix X''. 

2. Compute the p eigenvalues A,^ of X^X^^/n such that AJ > A| > . . . A^ 

3. Choose qi to be the largest value of i such that Aj > A^'. 

First we destroy the systematic correlations between the columns of X via random permutations of 
each column. Then we estimate the singular values of the permuted covariance matrix and compare 
these with the singular values of the unpermuted covariance matrix. Only those singular values 
that exceed the ones from random permutation are deemed significant and the lower bound on 
latent dimensionality qi is estimated to be the cardinality of those singular values. 

If Pi G RP^P are permutation matrices then we can write: 



Suppose 



1 

x' = -J2 - - fifpF (31) 



A = Ua^aVj^ (32) 



is the singular value decomposition of A with = diag(fTa-) 

Let q be the true latent dimensionality. Then for large n it can be shown (and verified by simulation) 
that: 

cr^. -I- CT^ if i < q 

Xi= { if q+l<i<{p-l) (33) 

if i = p 



6.2 Stage 2 - Cross Validation 



13 



and 

* 1 if z = p 



Thus the non-zero eigenvalues of X" satisfy > cr"^, the noise variance. Hence the largest index i 



such that Aj > A^ is a lower bound for the latent dimensionality q. 



6.2 Stage 2 - Cross Validation 

Suppose the true latent dimensionality is assumed to be q. Then for large n the eigenvalues Aj will 



follow equation (33) i.e, the eigenvalues {Aj, i = {q + 1) . . . {p — 1)} should be well approximated by 
a constant o"^ . We estimate the quality of this model using leave one out cross validation |20j . Let 
Mg'' be the mean of the eigenvalues Ai from q + 1 to p — 1 excluding the index k. 

^^" = JZ^, E A. (35) 

The leave one out cross validation error assuming the true latent dimensionality to be q at point k 
is given by: 

E{q,k) = (\k-Mg'')\k = q + l,...,p-l (36) 

The mean cross validation error and its variance can be estimated from these pointwise values as 
follows: 

m = — i— X: E[q, k) (37) 

P ^ q k=q+l 

\aiiE{q)) = ] \ai{E{q, k),k = q + 1, . . . ,p - 1} (38) 

p — 1 — q 

When q is smaller than qtme then both E{q) and Yai{E(q)) will be large and when q is greater than 
qtrue then both E(q) and Var(£'(g)) will be small. Since the eigenvalues are expected to remain 
constant beyond qtrue the change in E{q) will be small beyond qtme- We define the following index 
for a given value of q quantifying the change in cross validation error from q to q + 1. 

m = , (39) 

Vav{E{q)) +Yav{E{q + 1)) 



6.3 Validation of the approach 
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If qtrue is the true latent dimensionality then A{q) will tend to have a maximum at qtme — 1 since 
this is the point which will show the largest change in cross validation error in going from qtme — 1 
to qtrue- We thus propose the estimate 



qtrue = 1 + argmax A(5), g = qi, . . . ,p - 4 (40) 
g 

where qi is a lower bound on q calculated from stage 1. 

The estimation of E{q) uses a smaller number of points when q gets very close to p — 1. Thus 
the estimate A(g) becomes unstable when q is within a few time points of p — 1. In order to 
robustify our estimate against this instability we define the cumulative maximum index function 
which calculates the index of maximum of A{q) from q = qi, . . . ,r. 



/(r) = argmaxA(g),g = qz, . . . ,r (41) 

Then we count the number of times that a maximum is detected at y 

g{y) = Card{r : /(r) = y} (42) 
and define the estimate of dimensionality to be 

qtrue = 1 + argmax5((y) (43) 

y 



6.3 Validation of the approach 

To test this latent dimensionality algorithm we generate data as per equation ([T|. Details of the 
simulation are as follows: 

1. The true mixing matrix A was chosen to he a p x q matrix where the elements were drawn 
from a uniform distribution in (0,1). This random matrix was then scaled so that its minimum 
singular value amin{^) = 1- 

2. The sources s in ([T]) were generated from three different types of distributions based on their 
kurtosis "excess" values 7 . The chosen distributions were Gaussian (7 = 0), Uniform (7 < 0) 
and Gamma (7 > 0). 

3. The ratio of noise standard deviation in 
was varied between 0.75, 1, . . . , 2. 

4. The ratio of the true latent dimensionality to the dimensionality of the observed data 
was varied between 0.1,0.2,. . . ,0.5. 



), fj to the minimum singular value of A, 
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5. This process was repeated 20 times for each combination of source type, ratio and 

2i=^ ratio. 
p 

6. For each individual simulation we estimated the latent dimensionality q based on the 2 stage 
estimation strategy described above. 

We chose p = 50 and n = 1000 as fixed parameters of the simulation. Simulations show that this 2 
stage estimation is almost unbiased for both Gaussian and non-G aussi a n emb edded sources for all 



values of the ratios 



and 



qtr 



Results are shown in figure 1(a) - 1(c) 



7 Benchmarking: Comparison with other BSS algorithms 



Choosing the negative entropy index as our objective function and without imposing any additional 
constraints, our algorithm attempts to estimate sources that are independent. To test and compare 
our algorithm with others, we used an approximation to negative entropy as proposed in |26J (see 
appendix for details). 



ICALAB [8], [7] (available from http: //www. bsp .brain. riken. go . jp/ICALAB/ 1 is a Matlab pack- 



age for comparing algorithms for BSS. We used ICALAB to compare the performance of our algo- 
rithm with other standard BSS algorithms such as FJADE [6j, FPICA [26^, [24, EFICA [27], [28] 
, ERICA [13] and UNICA [TJ] which use higher order statistics to separate sources. 

The quality of source extraction is measured using the Source to Interferences Ratio (SIR) [37] 
(of the estimated mixing matrix) which measures the ratio of the energy of the estimated source 
projected onto the true source to the energy of the estimated source projected onto the other 
sources. Higher values of (SIR) indicate better performance. Please see the appendix for details. 
ICALAB also comes with standard benchmarking datasets ( ^http : / / www . bsp . brain . riken . jp7] 
ICALAB/ ICALABSignalProc/p. 



A Monte Carlo analysis was performed using ICALAB by generating uniformly distributed random 
matrices Ai and creating a mixed source data-set Xi for a given set of sources S. 

X, = AiS, i = l,2,...,nb (44) 



To get a baseline measure of performance for each algorithm, we use square mixing without addi- 
tional noise for the simulation. Each algorithm was then run on this mixed data set. This process 
was repeated Uh = 100 times for each dataset using a new mixing matrix every time. In ICALAB, 
most of the algorithms are given default parameters that are tuned optimum values for typical 
data. As suggested in ICALAB, we use these default algorithmic parameters for benchmarking 
purposes. 
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(d) Illustration of latent dimensionality estimation 



Figure 1: (a), (b) and (c) depict simulations showing the performance of latent dimensionality 
estimation procedure on various source types at different ratios parameterized by various 

q/p ratios, (d) Latent dimensionality estimation for a Gaussian sources with qtme = 35, p = 100, 



n = 1000 and 



0.75. A(g) attains a maximum for g = 34 and so g = 1 + 34 = 35 
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ADIS is able to perform non-square BSS in the presence of noise. However, since the other algo- 
rithms in our benchmarking test have not been designed to do this, we think its unfair to compare 
non-square ability of ADIS with other algorithms. 

The 13 benchmarking datasets and their short descriptions are given in the appendix. In order to 
evaluate the effect of different types of mixing matrices, we ran additional Monte Carlo simulations 
when A was chosen to be one of the following (a) Random sparse (b) Random bipolar (c) Sym- 
metric random (d) 111 conditioned random (e) Hilbert (f) Toeplitz (g) Hankel (h) Orthogonal (i) 
Nonnegative symmetric (j) Bipolar symmetric (k) Skew symmetric. 



The results are shown in figures 2(a) - 4(d) and table [TJ The key performance feature is the SIR 



index [37] (see appendix for definition), the higher the value the better. Another important feature 
is the variability of SIR over 100 mixtures each generated using a different mixing matrix but 
containing the same underlying sources. Ideally an algorithm should be robust enough to converge 
to the same solution regardless of variation in the mixing matrix. The standard deviation of SIR 
captures this variability, the lower the variability of SIR the better. Additional results showing 
simulation results with different types of mixing matrices A are shown in |7(a) - |9(d)[ 



ADIS perfomed better than all other algorithms in almost all cases both in terms of the mean 
SIR index as well as the standard deviation of the mean SIR, which measures the robustness and 
convergence to the same solution. ADIS also produces extensive convergence diagnostics a sample 



of which is shown in figure 6(a) These diagnostics guarantee convergence and improve confidence 
in the estimated sources. 

An illustration of performance improvement using ADIS (Stage 1-1-2) over Stage 1 is shown fir the 
ACsparselO dataset in figure [5(b)| The corresponding convergence diagnostics are shown in figure 
|6(b)[ ADIS Stage 2 performs better than ADIS Stage 1 but we did not observe as dramatic an 
improvement as seen for ACsparselO. Other ICA algorithms were outperformed using only ADIS 
Stage 1. 

All experiments were performed on a computer with an Intel Xeon (TM) processor (3.4 GHz) and 
4GB of RAM. The runtimes of ADIS (Stage 1) were comparable to those of FPICA, ERICA and 
UNICA. We found EFICA and JADE to be faster than other algorithms in general. 



8 Case Study using real fMRI data 
8.1 fMRI case study 

To demonstrate how ADIS performs on a real dataset, we used the "FSL Evaluation and Example 
Data Suite" (FEEDS) from FMRIB Image Analysis Group, Oxford University. The URL for this 
data suite is: http://www.fmrib.ox.ac.uk/fsl/feeds/doc/index.htnil One of the datasets in 
the example suite contains an audio visual experiment with two explanatory variables, the visual 
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Parameters of the statistics: 
Mean = 1 2.6895 6,1 7633 7.58203 2.06524 2,02682 1 6.1 839 [dB], 
Std = 4,9292e-07 1,2181 1.5404 0,28736 0.24467 0.00085621 [dB] 
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luNICA 
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Parameters of the statistics: 
iviean = 1 0.1 425 1 5.61 1 6 1 5.3378 8.361 55 8.34187 1 7.3485 [dB], 
Sid = 1 ,1 592e-07 1 ,2782 1 .0605 0,371 94 0.73508 0,86397 [dB] 
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Parameters of the statistics: 
lUlean = 7.1563 4.9042 6.9492 4.6043 4.6137 9.8083 |dBl. 
Std- 1.375e-07 0.86771 1 5722 2.0459 2 0033 0.0001421 [dB] 



Parameters ot the statistics: 
Mean. 11.1837 16.4712 13.8151 7.43281 7,42478 17.3276[dB], 
Std -4.7971 B-07 0.69983 0.90194 0.51499 0.45778 0.28618 [dB] 





Mean SIRS torA[dB] 



Mean SIRS torA[dB] 



(c) GnBand 



(d) acspeechlB 



Figure 2: Histograms of mean SIR for each algorithm over 100 Monte Carlo simulations using 
randomly generated mixing matrices for various benchmarking datasets. 
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Parameters o1 the statistics: 
Mean = 23.4584 25.7143 23.1289 8.67669 8.67091 25,6381 [dB], 
S1d = 4,9851e-07 4,6236 0.13626 0,021637 0,0012764 4.2696e-05 [dB] 
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Parameters of the statistics 
Mean - 63,4168 3.23271 8.40884 4.2814 4.61709 15.5156 |dB], 
Std - 0.00065331 3.8223 3,3036 1.6977 1,7091 0.69494 [dB] 
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Parameters of the statistics: 
IVIean = 0.82866 9.5245 1.2515 0.83222 0.75928 9.8824 [dB], 
Std = 0.13961 0.53611 0.47259 0.25704 0.20297 0.3263 [dB] 
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Parameters o1 the statistics: 
lUlean - 19,6799 24.2719 13,2701 20.1706 20,1546 25.0599 ]dB], 
Std - 4 76566-07 0.44698 0.56654 0.20558 0.08175 0.25144 ]dB] 
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Figure 3: Histograms of mean SIR for each algorithm over 100 Monte Carlo simulations using 
randomly generated mixing matrices for various benchmarking datasets. 
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Parameters of the statistics; 
Mean = 1 .231 6 2.7579 2,4724 1 .3334 1 .301 9 3.5866 [dB], 
Std- 1,8675e-07 1,0922 0.001836 0,66476 0.68447 0,30894 [dB] 
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Parameters of the statistics; 
Mean = 8.95666 6.44917 6.78942 5.36836 5.72345 10.1397 [dB], 
Std- 2,6941e-07 2,4783 0.2247 0.99503 0,78989 2,6747e-06 [dB] 
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Parameters o1 the statistics: 
Mean - 9,18377 15,4489 8,77194 3,73458 3,6949 15,7211 [dB], 
Std = 1 .4601 e-07 0.65632 0.99227 0.7741 1 0.7441 9 0.381 98 [dB] 




Parameters of the statistics: 
Mean - 1 5,4765 20,581 3 1 8,8432 7,95041 7,94034 21 ,0954 [dB], 
Std = 1 .0352e-06 0.88785 0.096563 0.1 1 471 0.08401 3 0.1 1 986 [dB] 
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Figure 4: Histograms of mean SIR for each algorithm over 100 Monte Carlo simulations using 
randomly generated mixing matrices for various benchmarking datasets. 
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Algorithm, M = SIR [dB] and S = std{SIR) [dB] 
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GnBand 
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0.90194 


0.51499 
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M 


23.4584 


25.7143 


23.1289 


8.67669 


8.67091 


25.6381 
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4.9851e-07 


4.6236 


0.13626 


0.021637 
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4.2696e-05 


ACsparselO 


M 


63.4168 


3.23271 


8.40884 


4.2814 


4.61709 


15.5156 


S 


6.5331e-04 


3.8223 


3.3036 


1.6977 


1.7091 


0.69494 


25SpeakersHALO 


M 


0.82866 


9.5245 


1.2515 


0.83222 


0.75928 


9.8824 
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0.13961 


0.53611 
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2.4724 


1.3334 
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3.5866 
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1.8675e-07 


1.0922 
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0.66476 


0.68447 


0.30894 


X5smootli 


M 


8.95666 


6.44917 


6.78942 


5.36836 


5.72345 


10.1397 


S 


2.6941e-07 


2.4783 


0.2247 


0.99503 


0.78989 


2.6747e-06 
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Table 1: Mean SIR (M) and its standard deviation (S) for various benchmark datasets over 100 
Monte Carlo simulations for different algorithms. The benchmark datasets are a part of ICALAB 
[8 , [7j. An 'x' means that the algorithm failed to converge. The algorithms were ranked based on 
not only their mean SIR (M) and standard deviation (S) but also the entire SIR histogram. The 
color red is used to mark the best performing algorithm and the color blue is used to mark the 2nd 
best. In cases where more than one algorithm is marked with the same color, both algorithms were 
judged to perform equally well. 
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Parameters of the statistics: 
IVlean = 5.9408 3.4533 0.41501 0.43314 6.3234 |dB], 
Std = 0.20937 0.13021 0.10444 0.11409 0.1699 [dB] 




IVlean SIRsfor A [dB] 



(a) 64sounds mean SIR 



Parameters of the statistics: 
Mean = 63.4168 2.9502 8.55096 4.86705 4.61209 15.3346 33.6339 [dB], 
Std = 0.0006503 3.7029 3.1703 1.6424 1.352 0.68023 0.00056478 [dB] 
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IaDIS Stage 1 
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Mean SIRsfor A [dB] 

(b) ACsparselO ADIS Stage 1+2 

Figure 5: Histograms of mean SIR for each algorithm over 100 Monte Carlo simulations using 
randomly generated mixing matrices on (a) the 64sounds benchmark dataset. This is a relatively 
large dataset with 64 sources. FJADE fails on this test case, (b) the ACsparselO benchmarking 
dataset. This figure illustrates the improvement in SIR using the joint optimization of ADIS Stage 
1+2 over ADIS Stage 1 
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(b) 2 stage convergence diagnostics 



Figure 6: (a) Sample Convergence Diagnostic plot for the lOhalo benchmark in estimating the 2nd 
component. Figure shows the evolution of objective function, the Lagrangian, norm of the gradient 
of the Lagrangian, norm of the constraint satisfaction error, norm of the Lagrange multipliers and 
penalty parameter over algorithm iterations. ADIS guarantees the optimality of the estimated 
sources, (b) Convergence diagnostics for ADIS Stage 1+2 for ACsparselO dataset. The joint 
optimization was initialized using the optimal solution from ADIS Stage 1. 
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Parameters of the statistics: Mean-12,E895 G.21E33 7.78543 2.08E17 2.01276 lG.184E[dB|, 

Std = "I.SSOeE^O? 1.028B 1.3237 0.48718 0.22996 0.00089893 [dB] 



Parameters Of the Statistics: Mean- 10,1425 15.555G 15.3149 8,60472 8.48549 17.3747[dB|, 

Std = 1.302e^07 1.5373 1.0755 0.70106 1.0537 0.82536 [dB] 
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Figure 7: Histograms of mean SIR for each algorithm over 100 Monte Carlo simulations using 
randomly generated mixing matrices for various benchmarking datasets. The properties of mixing 
matrices A used for the simulations are as follows: (a) Random sparse (b) Random sparse (c) 
Random bipolar (d) Symmetric random 
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Paramelera of the statistics; Mean - 23.4587 2G,4595 23.1529 10,847 13.244G 25.G374 [dB], 

Std = 0.005545 5.0283 0.14474 0.026423 3.1539 0.010756 [dB] 
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Parameters of 1he stBtistica: Mean - 5.5B08 0.GB843 0,30336 9.G574 7.3B87[dB]. 

Std = 0.51616 0.61537 6.6949e-16 1,2497e^14 0,040372 [dB] 
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Std = 3.2136e-14 0.50457 0.60872 3.5706e-15 0.27804 [dB] 
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Figure 8: Histograms of mean SIR for each algorithm over 100 Monte Carlo simulations using 
randomly generated mixing matrices for various benchmarking datasets. The properties of mixing 
matrices A used for the simulations are as follows: (a) 111 conditioned random (b) Hilbert (c) 
Toephtz (d) Hankel 
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Parameters o1 the statistics: MeBn-1.231G 2,7217 2.4725 1,3509 1.3239 3.5944[dB]. 

Std = 6.5422e-15 1.082 0.0019508 0.S5467 0.75843 0.32303 [dB] 
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Std = 1.2356e-07 0.69526 0.99738 77178 0.83914 0.33677 [dB] 
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Parameters of the statistics: Mean = 15.4765 20.7321 18 8506 7.95125 7.93223 21.118? [dB], 
Std = 9.7645e-07 0.84865 0.11657 0.197 0.0019804 0.13309 [dB] 
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Figure 9: Histograms of mean SIR for each algorithm over 100 Monte Carlo simulations using 
randomly generated mixing matrices for various benchmarking datasets. The properties of mixing 
matrices A used for the simulations are as follows: (a) Orthogonal (b) Nonnegative symmetric (c) 
Random bipolar (d) Skew symmetric 
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Parameters of the statistics: l^ean = 5.9518 3.4437 0.44428 0.80297 6.3131 [dB], 
3td = 0.21071 0.15381 0.12804 0.27411 0.16504[dB] 
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Figure 10: Histograms of mean SIR for each algorithm over 100 Monte Carlo simulations using 
randomly generated mixing matrices on the 64sounds benchmark dataset. This is a relatively large 
dataset with 64 sources. FJADE fails on this test case. A was chosen to be an ill conditioned 
random mixing matrix for this simulation. 
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stimulus (30s off, 30s on) and an auditory stimulus (45s off, 45s on) . Analysis was carried out using 
FEAT (FMRI Expert Analysis Tool) Version 5.4, part of FSL (FMRIB's Software Library). 

www . f mrib . ox . ac . uk/f si 

The following pre-statistics processing was applied; motion correction using MCFLIRT [Jenkinson 
2002]; non-brain removal using BET [Smith 2002]; spatial smoothing using a Gaussian kernel of 
FWHM 5mm; mean-based intensity normalisation of all volumes by the same factor; highpass 
temporal filtering (Gaussian- weighted LSF straight line fitting, with sigma=50.0s). 

ADIS estimated a latent dimensionality of g = 34. The source components activating the auditory 
and visual cortex were identified by inspecting the estimated voxelwise variance explained map. 



The results are shown in figures [TT 12 
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Figure 11: Latent dimensionality estimation summary for fMRI data. The number of latent sources 
were estimated to be g = 34. 
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(a) Estimated Auditory Cortex activating 
source ^-statistic for the audio-visual fMRI data 
thresholded at z > 5 
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(c) Convergence Diagnostic plot for the Auditory source esti- (d) Convergence Diagnostic plot for the Visual source estima- 
mation tion 




(e) Associated Timecourse (from mixing ma- 
trix) for the Auditory source 
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(f) Associated Timecourse (from mixing ma- 
trix) for the Visual source 



Figure 12: Result of applying ADIS (Stage 1) to audio-visual fMRI data 
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9 Conclusion 

We implemented ADIS, an algorithm for probabilistic, constrained and non-square projection pur- 
suit. We validated all aspects of ADIS including the latent dimensionality estimation procedure and 
its optimization core. When compared to other algorithms using standard benchmarking datasets 
using ICALAB, we find our algorithm outperforms other standard algorithms such as FastICA, 
FPICA, JADE, ERICA and UNICA in terms of both robustness and separation quality. Our al- 
gorithm also guarantees "optimality" for each blind source via extensive convergence diagnostics 
and enables the user to use arbitrary contrast function and constraints for BSS. We hope it will be 
useful as a general BSS tool for the signal processing and fMRI community. 

10 Appendix 

11 Negentropy Index 

Given a random variable X, the negative entropy a measure of non-Gaussianity. It is easy to 
show that imposition of independence on sources in BSS is equivalent to maximization of negen- 
tropy. 

Robust approximations to negative entropy were developed in [23]. If G is a non-quadratic, non- 
linear function and f is a Gaussian random variable of the same variance as X then the negentropy 
measure J[X) is given as ^26j 



J{X) oc [E{G{X)) - E{G{v)f 



(45) 



In this paper, we used the following function [23J for G: 



G{x) = log [cosh (x)] 



(46) 



where cosh (x) is the hyperbolic cosine function 



cosh (x) 



-I- e 
2~ 



— X 



(47) 



12 Sources to Interferences Ratio (SIR) 



The SIR ratio is defined in [37 . We give here a brief summary of the key equations from that 
paper. Let y = {yi,y2, ■ ■ ■ ,yk} and let Py be the orthogonal projector onto the subspace spanned 
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by yi, 2/2i • • • yfc- If s = [si, S2, ■ ■ ■ , Sk] are the true sources and ii si, S2, ■ ■ ■ , are the corresponding 
estimated values then define: 

^target — Psj C'^^) 

Winter f — ^s^j PsjSj ('^"^) 

The purity of source separation is measured using the SIR performance index defined as fol- 
lows: 

13 Benchmarking datasets 

The 13 benchmarking datasets and their short descriptions are as follows: 



(http : //www . bsp . brain . riken . j p/ICALAB/ICALABSignalProc/ 1 : 



• nband5 - contains 5 narrow band sources. This is a rather "easy" benchmark for second order 
separation algorithms but apprently presents challenges for higher order algorithms. 

• lOhalo - contains 10 speech signals that are highly correlated (all 10 speakers say the same 
sentence) . 

• GnBand - contains 5 fourth order colored sources with a distribution close to Gaussian. This 
is a rather "difficult" benchmark. 

• acspeechl6 - contains 16 typical speech signals which have a temporal structure but are not 
precisely independent 

• ABio5 - contains 5 typical biological sources 

• ACsparselO - contains 10 sparse (smooth bell-shape) sources that are approximately inde- 
pendent. The SOS blind source separation algorithms fail to separate such sources. 

• 25SpeakersHALO - 25 highly correlated speech signals 

• VsparserandlO - very sparse random signals 

• ACsincposlO - positive sparse signals 

• X5smooth - smooth signals 

• Speech20 - 20 speech/music sources 

• XlOrandsparse - random sparse signals 

• 64soundsstd - a variety of sound sources (64 in total) 
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14 Details on Optimization Algorithm 

Our optimization algorithm solves the general problem: 

min ^f{x) (51) 
s.t. Ci{x) = 0, i = 1, 2, . . . , m (52) 
s.t.g,{x)>0, j = l,2,...L (53) 

where x £ R^. 

We convert the inequality constraints into equality constraints via slack variables as follows: 

gj{x) - = (54) 
>0, j = l,2,...L (55) 

Thus the optimization problem becomes: 

min /(x) (56) 

s.t. Ci(x) = 0, i = 1, 2, . . . , m (57) 

s.t. 5i(x) - = 0, j = l,2,...L (58) 

Sj > (59) 

This problem is now an equality constrained problem where the inequalities have been replaced 
by the bound constraints on the slack variables. Thus it suffices to consider equality constrained 
problems with bounds on independent variables as follows: 

min f{x) (60) 
s.t. Ci[x) = 0, i = 1, 2, . . . , m (61) 
s.t. h < Xi < Ui, i = 1,2, . . .n (62) 

where x £ R^. 

Our code uses a trust region based augmented lagrangian approach to solve these bound constrained 
problems following closely the LANCELOT software package [12], |10] . The augmented lagrangian 
function for the above problem is defined as: 

m m 

C{X, A, IJ) = f{x) - J2 ^iCi(x) + XI '^«(^)^ (63) 
i=l ^ i=\ 

At each outer iteration k, given current values of A'^ and we solve the subproblem: 
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mill £(x, A'^j/Ufc) (64) 
s.t. k < Xi < Ui (65) 



If P is the projection operator defined as 

[P{z,l,u)]i-- 



k 

Zi 
Ui 



if Zi < k 
if k < Zi < Ui 
if Zi > Ui 



(66) 



then the Karush-Kuhn- Tucker (KKT) optimaUty condition for 64 is given as |10j: 

X- P{x-VxC{x,X'',fj.k),l,u) = 



(67) 



The outer iteration code is given in Framework 1. Note that the penalty parameter fik is updated 
based on a feasibihty monitoring strategy that ahows for a decrease in /i^ if sufficient accuracy is 
not achieved in solving the subproblem |64[ 

At each inner iteration we form a quadratic approximation to the augmented lagrangian and ap- 
proximately solve the inequality constrained quadratic sub-problem: 



min P i^P^^lx^ix, A, /i)p + V^£(x, A, ^)'^p (68) 

s.t. li < Xi < Ui (69) 
s.t. IWloo < A (70) 

The inner iteration code uses non-linear gradient projection [5] followed by Newton-CG-Steihaug 
conjugate gradient iterations p6j. Quasi-Newton updates are performed using either SRI [11] (rec- 
ommended for non-convex functions) or BFGS [4j (recommended for convex functions). For very 
large problems, we switch to the limited memory variants [32 of these quasi-Newton approxima- 
tions. The algorithm details are given in Framework 2. The trust region update code is based on 
a standard progress monitoring strategy [33j and is given in Framework 3. 



15 Optimization Benchmarks 



The optimization core of ADIS has been tested on many benchmark problems from the GAMS li- 
brary at http : //www . gamsworld . org/perf ormance as well as benchmarks from MINOS [3T] . 
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Algorithm 1 Fl: Outer Iteration 

Require: Initial point Xinit, A°, jj^q, 9^ G (l,oo), 9i G (0,1) 

1: Choose tolerances Ty*^^ and ri*^ad- '^^^ default in ADIS is r/*^^ = r]*^ad ~ ■ 

2: 11 = flQ, r]con = 1/^*0 ^ Vgrad = V/^O 

3: for A; = 0,1,2,... do 

4: found = 

5: while found / 1 do 

6: Try to find Xk such that 

\\xk - P{xk - VxJO-{xk, A'', /Ufe), Z, n)||oo < r]grad via F2 using starting point as Xk-i- 

7: if above step is completed successfully then 

8: Set found = 1 

9: else 

10: Xk+^ = Afe 

11: = (^l^J'k 

12: Vcon = '^/iJ^k^ 

13: -qgrad = V/^fe 

14: end if 

15: end while 

16: if \\c{xk)\\oo < r]con then 

17: if ||c(xfe)||oo < and 

\\xk - P{xk - VxJCixk,X'',0),l,u)\\oo < vlrad ^^en 

18: Stop and return current solution x^- 

19: end if 

20: A'=+^ = Afc - Mfcc(xfc) 

21: /Xfe+i = Ilk 

22: 1]con — Vcon/ l^k+1 

23: f]grad ~ 'Hgrad/ 

24: else 

25: A'^+l = Xk 

26: Hk+l = OhfJ'k 

27: rjcon = ^/l^k^ 

28: rigrad ='^/lJ'k 

29: end if 

30: end for 
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Algorithm 2 F2: Inner Iteration 

Require: jmax, %rad, A, Z, u, A*^, /Xjt, 77 G (0, 1), /Za^ 

1: found = 

2: X = Xfc_i, J = 1 

3: Compute, 5 = Vx^{x, A'^, //fe) 

4: Estimate B = V'^^C{x, X'^, fik) using BFGS, SRI or limited memory BFGS, limited memory 

SRI quasi Newton Updates. 
5: while found ^ 1 and j < jmax do 
6: Calculate the Cauchy point pc for problem: 

min p ^p^Bp + g^p (71) 
s.t. I - X <p <u- X (72) 
s.t. IIpIIoo < A (73) 

using non-linear gradient projection and calculate the current active set A. Let be the 
unit vector with 1 at position i and zeros elsewhere. If zi,Z2,...% ^ A then let Q = 

^ii 1 612 ) • • • ) Cj^] . 
7: g = Q'^{g + Bp^) and B = CfBQ 
8: Compute the approximate solution v to the problem 

min y ^v^Bv + g'^v (74) 

s.t. I — X < pc + Qv < u — X (75) 
s.t. IIpc + Q^^IIoo < A (76) 

using truncated conjugate gradient iteration (Newton-CG, Steihaug). If flag = 1 use pre- 
conditioned Newton-CG using the inexact-modified Cholesky factorization. 

9: Compute p = Pc + Qv 

10: Calculate 6c = C{x) — C{x + p), 6m = 0.5p^Bp + g^p and p = 6c/ 6m 
11: if p > f] then 

12: X = X + p 

13: end if 

14: Compute new trust region radius A using Framework F3. 

15: Compute, g = VxC,{x, X^, Pk) H p > rj holds otherwise use the previous value. 

16: Estimate B = V^^£(x, X^,pk) using BFGS, SRI or limited memory BFGS, limited memory 

SRI quasi Newton Updates. Do the update even if p < 77. 

17: if ||x - P{x - VxJO-{x,X'',pk)J,u)\\oo < Vgrad then 
18: found = 1 

19: end if 

20: i = i + 1 

21: end while 
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Algorithm 3 F3:Trust Region Update 
Require: p, p, A 



1 


if ;9 > 0.75 then 


2 


if IIpIIoo < 0.8A the 


3 


A = A 


4 


else 


5 


A = 2A 


6 


end if 


7 


end if 


8 


if 0.1 < p < 0.75 then 


9 


A = A 


10 


else 


11 


A = 0.5A 


12 


end if 


13 


return A 



This section will describe some numerical experiments on interesting and difficult optimization 
benchmarks used to test the optimization core of ADIS. For these benchmarks, the gradient infor- 
mation was generated using automatic differentiation from the software package INTLAB |3l]. A 
limited memory variant of symmetric rank 1 (SRI) updating was used. The CG iterations were 
not preconditioned. The coiivGrgcncG tolerances ?7con 

and rjtoi were set to their default value of 

le-6. 

Electron 50 

Given Up electrons, find the equilibrium state distribution (of minimal Columb potential) of the elec- 
trons positioned on a conducting sphere. This problem is from C0PS3 [15j benchmark dataset. 

The problem is to find a configuration of low energy for a given set of point charges on a conducting 
sphere. It originated with Thomson's plum pudding model of the atomic nucleus and is represen- 
tative of an important class of problems in physics and chemistry that determine a structure with 
respect to atomic positions. Mathematically, the problem is: 




(77) 



Zi - Zj) ] 



21-0.5 



i=l j=i+l 



subject to 



xf + yf + zf = l,i = 



1 , . . . , Up 



(78) 
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The 150 variable problem for Up = 50 was taken from GAMS performance library (PrincetonLib 



(NLP)) at http://www.gamsworld.org/performaiice/priiicetoiilib/htm/fekete/fekete2.htiii 



The best known objective for this problem for Up = 50 is /* = 1055.1823. Our code attains this 



best objective in 13 outer iterations. See figure 13 for convergence diagnostics. 



Non-negative Least Square (NNLS) 

In NNLS we solve the problem: 

min^ f = \\Ax-b\\2 (79) 

subject to: 

Cx-d>0 (80) 
We solve the 300 variable optimization problem taken from GAMS performance library (Princeton- 



Lib (NLP)) at |http: //www.gamsworld. org/perf ormance/princetonlib/htm/imls/mils .htm 



The best known object for this problem is /* = 633785.4462. Our code attains this best objective 



in 26 outer iterations. See figure 14 for convergence diagnostics 



Largest Small PolyGon 

This is a classic problem also from C0PS3 [15j benchmark dataset. Given coordinates (rj,0j) of 
the rij] vertices of a polygon, we wish to solve the problem: 

riv — l 

max r,e f{r,9) = 0.5 ^ r^+irj sin(6'i+i - Oi) (81) 

i=l 

subject to: 

rf + r| - 2 nrj cos{0i - Oj) <l, I < i < n^, i < j < Uy (82) 

e^<6^+l,l <i <ny (83) 
(9i G [0,7r],ri > 0, 1 < i < (84) 

Some of the interesting features of this problem that make it difficult include the presence of the 
order of nonlinear nonconvex inequality constraints and the presence of 0{ny\) local minima. See 
|15] for more details. We solve the Uy = 6 problem taken from GAMS performance library (Prince- 



tonLib (NLP)) at http://www.gamsworld.org/performaiice/princetoniib/htm/polygon/pgoii. 
Etml 

The best known objective in GAMS library for this problem is /* = 0.5. 
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Figure 13: Electron 50 convergence diagnostics. The best objective of /* = 1055.1823 was attained 
in 13 outer iterations. Figure shows the evolution of objective function, the Lagrangian, norm of 
the gradient of the Lagrangian, norm of the constraint satisfaction error, norm of the Lagrange mul- 
tipliers and penalty parameter over algorithm iterations along with verification of KKT optimality 
conditions. 
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Figure 14: Non-negative least squares convergence diagnostics. The best objective of /* = 
633785.44 was attained in 26 outer iterations. Figure shows the evolution of objective function, the 
Lagrangian, norm of the gradient of the Lagrangian, norm of the constraint satisfaction error, norm 
of the Lagrange multipliers and penalty parameter over algorithm iterations along with verification 
of KKT optimality conditions. 
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Figure 15: Largest Small Polygon convergence diagnostics for = 6. Note that we first convert the 
problem into a minimization problem by multiplying the objective function with -1. The optimal 
objective value of —0.675 for the minimization problem was attained in 11 outer iterations. Figure 
shows the evolution of objective function, the Lagrangian, norm of the gradient of the Lagrangian, 
norm of the constraint satisfaction error, norm of the Lagrange multipliers and penalty parameter 
over algorithm iterations along with verification of KKT optimality conditions. 
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We first convert the maximization problem 81 into a minimization problem by multiplying the 
objective function by -1. We solve this problem and then evaluate the original objective [81] at the 
solution. We find that our algorithm attains an objective of / = 0.675 in 11 outer iterations. 

This is better than that reported by GAMS solvers. We were a little surprised by this observation. 
On researching this problem further we found that Graham [19j had solved this problem in 1975 and 
the best solution is indeed 0.675 (see http : //mathworld . wolfram . com/GrahamsBiggestLittleHexagon . 
|htnil[). On plotting the optimal hexagon estimated by our code alongside the solution from [19j . 



we find that they are identical. See figure 15 for convergence diagnostics. 
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